Mapping of Sequence Reads to the Reference Genomes ◾ 65
information. Most aligners are capable of performing both exact matching and inexact
matching, which are essential to find the locations of reads that may have some base call
errors or varied genetically from the reference genome. The different aligners implement
different algorithms to perform both kinds of lookups in the indexed reference genome
stored in data structures like suffix tree, suffix array, hashing table, and BWT. While the
exact lookup is straightforward, the inexact matching uses sequence similarity to find the
most likely locations where a read is originated. Although there are different ways to mea-
sure sequence similarity, most aligners used Hamming distance [12] or Levenshtein dis-
tance [13] to score the similarity between a reads and portions of the reference genomes
based on a threshold. Some aligners use the seed-and-extend strategy to extend a seed (an
exact matched substring) across multiple mismatched bases to allow mapping reads with
base call errors or variations. Most aligners employ seed-and-extend strategy on the local
sequence alignment using SW algorithm. Seeds are created by making overlapping k-mers
(substrings or words of length k) from the reference genome sequence. Some aligners like
Novoalign [14] and SOAP [15] index k-mers with the trie or hash table data structures for
a quick search.
2.3.1 SAM and BAM File Formats
Almost all read aligning programs (aligners) store alignment information of the reads
mapped to the reference genome in a Sequence Alignment and Map (SAM) file or Binary
Alignment and Map (BAM) file, which is the binary form of SAM. The SAM file is a read-
able plain text file for storing biological sequences mainly aligned to a reference sequence
[16] but it can also contain unmapped reads. It is a TAB-delimited text file consisting of
two main sections: (i) a header section and (ii) an alignment section.
The header section of the SAM file is optional, and when it is present, it must be before
the alignment section. Each line in the header section must start with “@” symbol followed
FIGURE 2.14 A header section of a SAM file.